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Abstract 

Given some of the recent advances in Distributed 
Hash Table (DHT) based Peer- To- Peer (P2P) systems 
we ask the following questions: Are there applications 
where unstructured queries are still necessary (i.e., the 
underlying queries do not efficiently map onto any 
structured framework) , and are there unstructured P2P 
systems that can deliver the high bandwidth and com- 
puting performance necessary to support such applica- 
tions. Toward this end, we consider an image search 
application which supports queries based on image sim- 
ilarity metrics, such as color histogram intersection, 
and discuss why in this setting, standard DHT ap- 
proaches are not directly applicable. We then study the 
feasibility of implementing such an image search sys- 
tem on two different unstructured P2P systems: power- 
law topology with percolation search, and an optimized 
super-node topology using structured broadcasts. We 
examine the average and maximum values for node 
bandwidth, storage and processing requirements in the 
percolation and super-node models, and show that cur- 
rent high-end computers and high-speed links have suf- 
ficient resources to enable deployments of large-scale 
complex image search systems. 



1 Introduction 

The first widely known pure P2P system that tried 
to bring Napster-like functionality was the unstruc- 
tured P2P system Gnutella. The overlay network cre- 



ated by Gnutella's peers forms a random graph, where 
search was mostly done via complete flooding of the 
network. Its imperfections have spawned extensive re- 
search. Much of this research is directed towards DHTs 
(e.g. [5, 12]), and distributed indexing structures de- 
rived from DHTs (e.g. [13]), super peer architectures 
(e.g. [15]) and approaches improving the link structure 
of P2P networks (e.g. [7]). 

DHTs excel at key-value lookup because that is the 
basic primitive of the hash table data structure. How- 
ever, hash tables are not the most efficient data struc- 
ture for all algorithms. In this work we are interested 
in storing images, which may be thought of as vec- 
tors of large dimension, and searching to find images 
which are close to a query image by some given met- 
ric. This work compares content based image retrieval 
(CBIR) in structured against unstructured P2P sys- 
tems. Unstructured systems can answer completely 
general queries, since there is no structure imposed on 
the data. We ask if sacrificing the flexibility of unstruc- 
tured systems for structured P2P systems results in a 
reduction of query bandwidth for the case of CBIR. 

Within this article, we treat content-based image re- 
trieval as an application scenario. We think that this 
scenario is interesting for P2P in several respects. (1) 
We believe that there is a need for such applications. 
The recent success of image blogs and image sharing 
servers like f lickr . com has shown that people have the 
wish to share and to publish their images. (2) Current 
widespread methods of indexing such images are unsat- 
isfactory. In flickr.com, the images are searchable 
by annotation. Unfortunately the annotation quality is 
low (as will be described more in-depth below). Orga- 
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nizing images by time [2] is useful when indexing one's 
own collection where images can usually be grouped 
into images pertaining to events that are interesting to 
the user, but it becomes barely useful when considering 
collections that grow each second by at least one image. 
(3) Alternative methods are expensive. Sophisticated 
content-based image indexing methods that extract vi- 
sual features from the items to be queried need a large 
amount of processing power, and processing queries is 
expensive 1 compared to processing text queries. 

While CBIR is deemed unsatisfactory as a complete 
indexing solution for images, the conviction underlying 
this paper is that CBIR methods are going to be useful 
for improving f lickr . com like systems. Bad image an- 
notation is not going to go away. So, we seek systems 
that help the user in the case that good annotation is 
not present. We are also convinced that people will 
be interested in combined rankings, which combine vi- 
sual aspects, surrounding text, etc. into one common 
measure. Unless one wants to restrict oneself to sys- 
tems that first evaluate similarity with respect to text 
and simple meta-data before refining the search using 
CBIR and other complex methods, one will need to 
come up with good methods that are able to process 
CBIR queries. 

This paper presents two findings. First, we find that 
unstructured systems are efficient enough in terms of 
communication and computational costs for the CBIR 
application we consider to scale to millions of users. 
Second, in the case of image similarity search, cur- 
rent structured systems offer no advantage over un- 
structured systems. This paper is organized as follows. 
Section 2 discusses prior work on the nearest-neighbor 
search problem in high dimensional spaces, a general 
version of the problem we consider in this work. Sec- 
tion 3 describes the particular case we are focusing on: 
image similarity search at the scale of approximately 
one million users. Section 4.1 studies the cost of im- 
plementing an image search system using a supernode 
architecture and Section 4.2 studies the cost for a per- 
colation search based architecture. Finally, in Section 5 
we compare the unstructured image search systems to 
a structured image search system and find that struc- 
tured search offers no benefit for this case. 

2 Nearest neighbor search in high di- 
mension 

The nearest-neighbor search problem is the follow- 
ing: given a set of points in a metric space P, for a 

lr That means, they typically require many disk accesses, much 
processing power and as a consequence much time to be pro- 
cessed. 



given query point Q, find an element x of P such that 
d(x, Q) < d(y, Q) for all y e P. 

The classic way of performing CBIR is to extract 
real-valued feature vectors from images and then map 
the search problem to the problem of finding the k near- 
est neighbors (fc-NN) to the query vector. For d <« 10 
there exist centralized data structures that find the fc- 
NN in 0(log N) time for collections of size N. However, 
literature on non-distributed indexing has observed [14] 
that exact search in high-dimensional data is very hard, 
due to the so-called curse of dimensionality. Due to the 
curse of dimensionality tree-based indexing structures 
break down in the sense that in realistic scenarios O(N) 
nodes need to be visited before finding the exact fc-NN. 
For non-distributed — disk based — indexing structures 
one interesting and well-known solution [14] consists in 
rendering full scan queries more efficient using a full 
table scan approach. This algorithm exhibits O(N) 
complexity just as a tree-based indexing structure in 
high dimensions, however, the absolute query duration 
is reduced with respect to the tree-based solution. 

Though not the subject of this work, one may 
also consider the approximate version of the nearest- 
neighbor search problem. The approximate nearest- 
neighbor search problem is the following: given a set 
of points in a metric space P, for a given query point 
Q, and a slackness parameter e find an element x of P 
such that d(x, Q) < (1 + e)d(y, Q) for all y E P. An ef- 
ficient algorithm for the approximate nearest-neighbor 
search problem is known for several common metric 
spaces [4] . It is an interesting open problem to see if it 
can be efficiently adapted to a distributed system. 

Other proposals include using geometric dimension- 
ality reduction techniques. Kleis and Zhou provide a 
review of relevant results in the context of P2P net- 
works in [3]. 

2.1 P2P approaches to nearest-neighbor 
search 

Clearly, finding the set {x\d(x, Q) < 6} for a given 
Q,S is an embarrassingly parallel problem. If one 
spreads the database over a P2P network, one gets the 
full benefit of parallclization. In the absence of an effi- 
cient exact algorithm for the nearest-neighbor problem 
in high dimension, this may be the best one can do. 
Indeed we consider this approach in Sections 4.1 and 
4.2. 

In addition to the above, one can also attempt to use 
some structured P2P network to reduce the number 
of nodes that must be contacted to execute a query. 
We describe one such approach, PRISM below. The 
literature on P2P indexing using structured networks 
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is very broad. As one starting point for reading we 
suggest [13]. 

2.1.1 PRISM 

PRISM indexes each vector x by placing x on a small 
number of nodes in a Chord DHT. The placement of 
the vector is calculated using distances to a fixed set of 
reference vectors. When processing a query, the node 
issuing the query q calculates the set of nodes where q 
would be placed and searches for similar nodes there, 
sending them q as the query. The main innovation of 
PRISM is the algorithm for finding the nodes on which 
to place the data vectors. 

In order to index a vector x, the distance of x to a 
number n r of reference vectors r-j is calculated, yielding 
8 := (8161, . . . ,5 nr ) := (<5(x, n), . . . ,S(x, r n J). Then 
the ri are ranked by their similarity. The result of this 
ranking is a list of indices 1 = (ii, . . . , t„ r ) such that 
r tl is the reference vector closest to x, r t2 the second 
closest and so on. 

Then, pairs of indices are formed. The pair forma- 
tion is a fitting parameter, the original PRISM paper 
suggests {ti, ti}, {*,]., t 2 }, j>2, 13}, W, to}, W, t«i}, 

{i2, te}, {>2, t 4 }, {*3, t-4,} , W> ^5}, j>4, 15}, {>3, t 5 } for 
their dataset. From each of the pairs a Chord key is 
calculated, and this key is used for inserting the vector 
x into the Chord ring. 

As was hinted above, query processing works by 
finding out which peers would receive the query vec- 
tor if it was a new data item and forwarding the query 
vector to these peers. This involves, again, the calcu- 
lation of index pairs, which we will call query pairs in 
the following. In order to reduce query processing cost, 
the query processor can choose to contact only nodes 
pertaining to only a subset of the query pairs. Doing 
this also reduces recall, so there is a tradeoff. 

3 An image search system 

Within this section, we describe Flickr, a popular 
web-based photo sharing application. This application 
is currently immensely popular. At the same time, 
one could easily imagine extending its functionality to- 
wards content-based search. By examining Flickr we 
estimate the load for a P2P photo sharing system which 
we call Plickr. 

3.1 About Flickr 

flickr . com gives members the opportunity to share 
photos among the public, friends and family. Flickr 
members are allowed to comment on photos they can 




Figure 1. Example images tagged with the 
annotation tag phone, taken from fiickr.com 
with permission from Katy Wortman. 



see and to annotate them in a collaborative fash- 
ion. Recently, flickr.com has experienced explo- 
sive growth of popularity. As of the time of writing, 
flickr. com contains about 40 million images, most of 
them publicly accessible. We estimate that about 2 
million users share images via flickr . com. 

One of the reasons for our interest into flickr.com 
is that it has a SOAP-like API that allows easy access. 
This simplifies enormously building third party tools, 
as well as crawlers. As the user structure is quite simi- 
lar to what many people would like to have in P2P file 
sharing systems (people peacefully sharing data they 
actually own) we simply extrapolate from user behav- 
ior on flickr . com to the behavior they would have in 
a P2P network. 

Using flickr . com, we do not obtain data about the 
online times of the users. However, information about 
who shares how much is already useful. 

One of the most interesting features of Flickr's is 
that members can annotate other people's images. 
This leads to the surprising fact that most of Flickr's 
images are annotated. However, as it is made simple to 
add annotation to images by default (on a user-by-user 
basis) , the quality of the annotation is varying. For ex- 
ample, many Flickr members have a large fraction of 
their photos annotated with the tag phone. This tag 
describes how the image came to Flickr, by a camera 
built into a cellular phone. However, it is only rarely an 
accurate description of the images' content, as the ex- 
amples in Fig. 1 show. While the tag phone is clearly 
the most extreme case of annotation that carries lit- 
tle valuable information, it shows that just citing the 
number of images that are annotated does not permit 
assessing the usefulness of this annotation. 

3.2 Plickr: content-based Flickr over P2P 
as a scenario for P2P-CBIR evaluation 

The above ad-hoc assessment of annotation qual- 
ity motivates the view that it would be interesting to 
combine the search by annotation tag (as offered by 
Flickr) by search based on image similarity as provided 
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by Content Based Image Retrieval systems (CBIRS). 
While CBIRS are unable to do true object recognition, 
they capture visual similarity by translating each image 
into a data representation that captures mainly image 
statistics and in some cases the spatial relation of "in- 
teresting" regions. While using such features as the 
sole image retrieval method is currently deemed unsat- 
isfactory, CBIRS are still the method of choice when 
no annotation is present. 

Content Based Image Retrieval is computationally 
costly. Features need to be extracted, and similarity 
search is much harder than similarity search on text 
because the number of features taken into account for 
obtaining a retrieval result is usually much higher than 
the number of keywords in a keyword query. For more 
information about CBIR we point to an often-cited 
overview paper [10]. 

P2P becomes particularly interesting through the 
fact that P2P-CBIR potentially will make use of both 
the huge storage capacity and the huge computing ca- 
pacity distributed in the network. 

Let us consider a P2P network, in which each peer 
owner shares his or her own photos with other users 
of the same P2P network. The network would of- 
fer Flickr's functionality plus CBIR query by example 
functionality. However, in contrast to Flickr, its inner 
workings would be based entirely on P2P principles. 
We will call this hypothetical network the plickr net- 
workwithin this paper. 

3.3 Deriving a load scenario for plickr 

We assume the images contained in one Flickr user 
account to be a good model for the images shared by 
one plickr peer. Let us assume 1,000,000 users and 
thus 1,000,000 peers in our plickr network. Our Flickr 
crawls (w2, 000, 000 images)indicate that the average 
user who shares at least one image publicly shares on 
average 20 images. 

The same as Flickr, plickr members query the data 
collection for interesting images from time to time. 
Furthermore, we assume each user performs 10 queries 
on average per day. We feel that is reasonable as query- 
ing image is an exploratory process, so each querier is 
likely to perform a query process consisting of multiple 
queries. So, even if such a query process is performed 
less than once per day by each user, we are likely to 
reach the said average load. 

In this paper we will concentrate on a very simple 
way of performing CBIR: retrieval by color histograms. 
They are known to provide a good retrieval perfor- 
mance (i.e. result quality) for comparatively little com- 
puting power and are the pet feature extraction method 



for indexing structure evaluations. 

Color histograms are obtained by cutting the color 
space in regions. The color histogram then is a vec- 
tor that contains one value corresponding to each color 
space region. Each value of the histogram expresses 
for the corresponding color region the probability that 
a pixel drawn from the image falls into the color region. 
This probability is estimated by simply counting pix- 
els falling in each color region. The usefulness of color 
histograms for CBIR depends on the color space cho- 
sen and the way it is split into regions. For our load 
assumption, we assume John R. Smith's 166-D HSV 
histograms described in [11]. A wasteful but simple 
representation would be of type float [166] . 

The classic way of evaluating the similarity of two 
histograms is the histogram intersection, however, the 
testing ground for most CBIR indexing algorithm is the 
Euclidean distance which is why we focus that distance 
measure. 

4 Search with unstructured P2P 

Unstructured P2P systems have one major advan- 
tage over structured systems: once designed, imple- 
mented and deployed an unstructured system can gen- 
erally be used for any kind of query just by plugging 
in new query processing. There is no need to define 
new routing algorithms, network topologies, or caching 
strategies every time a new type of data or query is 
introduced to the network. On the other hand, struc- 
tured P2P systems may reduce search complexity only 
if the search algorithm can be efficiently mapped onto 
the topology of the structured network. 

In this section, we compute the costs in bandwidth, 
computational resources, and storage to use unstruc- 
tured P2P for two models: a super-node system, and 
a percolation search system. After computing the 
costs of the system, in order to estimate feasibility, 
we make some assumptions about the usage of the 
image search system described in Section 3. We as- 
sumed that each content item is a float [166] array, 
which uses 166 x 4 = 664 bytes of space. As we men- 
tioned in Section 3.3, every Flickr user inserts on av- 
erage C = 20 items into the network. Calculating the 
distance takes / = 332 floating point operations. Fi- 
nally, we will assume that there are N — 2 19 w 500, 000 
users, and that each user will make 10 queries per day 
or R = 24x60x60 = x 10~ 4 queries per second. We 
assume each query and content requires z — 800 bytes 
(enough to hold the float vector and some routing in- 
formation or image meta-data). 
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Table 1. Nomenclature used throughout the 
paper 



Symbol 


Meaning 


N 


Total number of peers 


(k) 


Expected degree of nodes within network 


Pm, 


relative frequency of nodes with degree m 


S 


Fraction of super peers 


R 


Query rate issued per peer (1.2 x 1CP 4 s~ x ) 


C 


Number of content items contributed per 
peer (20) 


f 


Number of operations to compare two vec- 
tors (332 FlOp) 


z 


Size of each vector (800 B) 


P>max ,ave 


max/avg bandwidth required per peer 


Pmax,ave 


max/avg processing required per peer 


D ma x,ave 


max/avg disk space required per peer 



4.1 Super- node P2P networks 

In the super-node model there are two types of 
nodes: leaf nodes and super-nodes. Each leaf node 
connects to a super-node and caches all its content on 
that super-node (the super-node does not need to cache 
its own content). The leaf nodes require very little re- 
sources, but super-nodes incur the maximum penalty. 
The only parameter in the system is s, the fraction of 
nodes which are super-nodes. 

4.1.1 Resource requirements 

To compute average bandwidth, we count total copies 
of the queries and divide by the total number of nodes. 
We should note that this metric is not very meaningful 
since no nodes see the average. Leaf nodes see almost 
no traffic, while super-nodes see the maximum traffic. 

Since each query is copied to sN super-nodes plus 
the leaf node that initiated the query, the average band- 
width is clearly, B ave = RN ^+ 1 ) Z = RNz{s + ±) w 
RzsN. All the super-nodes see the same bandwidth 
since all queries pass through them. If we assume that 
the query crosses each edge in the multicast tree (us- 
ing an approach similar to [6]) then it is necessary 
and sufficient for the maximum degree to be 3, thus 
B m ax — 3RzN, which is independent of s. 

Since there are CN total content items, the average 
disk space is D ave — CNz/N — Cz. Since all content 
is stored on the supernodes, the maximum disk space 
is D max = CNz/Ns = Cz/s 

In the super-node system, each content is only 
copied (at most) one time. The average processing 
requirement is not very meaningful since like average 



bandwidth, no node experiences this load. Nodes ei- 
ther see almost no load, or maximum load. We assume 
a linear complexity for search, so that P = RD( f / z)N: 

Pave = RDaveif / z)N — RC f N . 

Since, all the queries are processed by the super- 
nodes, we only need to compute the number of con- 
tent items on each super-node, and then multiply by 
the query rate: P max = RD max (f/z)N = ^fRN = 

RCfN 
s 

4.1.2 Trade-offs and numerical values 

One interesting feature of this model is that 
PmaxBave = fzCR 2 N 2 , which is independent of s. 
So, there is a trade-off between average bandwidth and 
maximum CPU utilization. In the interest of consid- 
ering some numerical values, we will set s = 1/y/N, 
which means each super-node has as many leaf nodes 
as there are super-nodes. In practice, one will probably 
prefer to minimize bandwidth to the extent that it is 
possible for the super-nodes to handle the load. 

In the following table, we present performance met- 
rics for the super-node algorithm with s — 1/y/N, 



= 2 19 and the values of R, C\ /, z from Table 1. 


B ave = Rz^fN 


= 70B/s = 560bps 


Bmax = 3RzN 


= 150,000B/s = 1.2Mbps 


B*ave — Cz 


= 16KB 


Drnax = C ' Zs/N 


= 11MB 


Pave = RCfN 


= A20kFlOp/s 


Pmax = RCfN^N 


= 300MFlOp/s 



The values look reasonable. Since modern CPUs 
have processing power on the order of 4 GFLOPS, the 
above processing requirements are not more than one 
CPU. The figure we might be most concerned about 
is Bmax, however that value is independent of s. Now 
we compare the above with the percolation search al- 
gorithm for unstructured networks. 

4.2 Percolation search in power-law net- 
works 

In [9] and subsequent work, the authors show 
that using a combined random walk data replica- 
tion/random walk query distribution scheme (to be de- 
tailed below) one can achieve sub-linear (in N) query 
complexity in power-law networks. Below we summa- 
rize this algorithm. 

The degree distribution pk of the network describes 
the probability to draw a node with degree k from the 
network pk = Ak~ T where A is a normalization con- 
stant such that J2i=2 x Pk — 1 • The main result of [9] 
is the following three-step algorithm: Step 1, Content 
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List Implantation: To insert cached replicas of the con- 
tent, a random walk is performed and each peer visited 
during this random walk receives a copy of the index 
data. For r = 2, the length of the random walk should 
be 0(logN). Step 2, Query Implantation: The above 
process is performed for each query. Step 3, Bond per- 
colation: After the query implantation, each node that 
has received a query so far will forward the query to 
each of its neighbors with a probability q = ^q c , where 
q c is the percolation threshold: (k)/({k 2 ) — (fc)), and 7 
is a small number greater than unity. 

For random power-law networks, there are two pa- 
rameters one may control: r the exponent of the power- 
law, and k max the maximum number of neighbors any 
node has. In this work we will only consider r = 2. 
For r = 2, \ = YltZT w ~ ~ 1-6 . Next we 

consider the scaling of the average and maximum of 
bandwidth, disk and processing requirements for the 
percolation search. 



4.2.1 Resource requirements 

Given that a node receives a query, with probability q, 
each neighbor sees that query. So, the total number of 
edges to see a query will be qE, where E is the total 
number of edges. Since E = (k)N/2, and q cx q c = 
(k)/({k 2 ) - (k)) «i (k)/(k 2 ). we have that the average 



RNEqz 
N 



Rz 



(k) 2 N 
2(fc 2 ) ' 



When 



Ak m ax, then we 



bandwidth cost is B ave 
p k = A/k 2 , (k) w A\nk max , (k 2 ) 
have B ave « RzN AX £^ ■ 

To compute the maximum bandwidth, we need to 
look at the highest degree node and see how many of 
its neighbors will see the query. The highest degree 
node has k max neighbors, and on average qk max will see 
the query, thus: B max = RzNk max q w RzNk max q c w 
RzNXn k max . 

For a power-law random network with exponent r = 
2, the content is cached on log 2 N nodes. Thus, the 
average storage requirements are D a 
Cz\og 2 N- 



Cz In N 



CzN log 2 N 
N 



In 2 - 

To compute the storage required for the maximum 
node is more involved. We model a random walk on 
a random network as each step selecting a random 
node of degree m with probability mp m /(k). The 
probability of selecting a node of the highest degree is 
P s — t^)u'i A = {kmax lnfc max ) _1 . We assume that we 
select each of the nodes of degree k max with equal prob- 
ability, so the probability we select each one of them is: 
Q s = P s (Np(k max ))^ 1 ■ There are log 2 N steps, so the 
number of content caches that make it to highest de- 
gree nodes is F max = Q s \og 2 N = ln 2AN\nk malc - Now 



we can compute the maximum storage requirements: 

Czk max In TV 



D r , 



A In 2 In k n 



As before, we assume that the search time is linear 
in the number of items stored at each node. Since we 
have already computed the number of items stored at 
each node, we have: 



Pa 
P 

± mi 



RND ave (f/z) = RCfNlnN^ 



RND max (f/z) = RCfN 



k max In N 
A In 2 In k ma . 



To reduce P ma x we need to reduce k max , but that 
will increase B ave . 

4.2.2 Trade-ofFs and numerical values 

In the percolation search, like the super-node system, 
we can decrease the average bandwidth required at 
the expense of increasing the maximum processor uti- 
lization. Using the percolation search B ave P max = 
fzCR 2 N 2 lnk ™^ inN , which is similar to the super- 
node architecture except with some logarithmic factors. 
In order to minimize average bandwidth, we should 
choose k m ax to be as large as possible. In general, 
the percolation search algorithm behaves like the ideal 
super-node algorithm with s w l/k max . 

To compare to the super-node case, we choose 
k m ax = VN, N = 2 19 and the values of R, C, /, z from 
Table 1 and using the same constants we assumed in 
4.1.2, which means our highest degrees are comparable 
to super-nodes. The results for this case are summa- 
rized in the following table: 

Pave 



B„ 



fRzVNlnN 
\RzN\11N 
Cz'ff 

RCfN^t 
RCfN^N- 2 



v 7075/s = 560bps 
w 330KB/s = 2.7Mbps 
w 300KB 
=a 52MB 
=s 7.9MFlOp/s 
Aln2 nlAGFlOp/s 
For Dmax and hence, P m ax there is a constant fac- 
tor overhead of 2 /{A In 2) w 4.6 when compared to the 
ideal super-node case. For all other metrics, there is 
an 0(ln N) overhead for using the percolation search, 
however, for networks of size N = 2 19 , due to the divi- 
sion by a constant, the difference is not very great. 

4.2.3 Simulation results 

Our simulations use the Netmodelcr package [1]. We 
insert 1000 content objects at uniformly selected nodes 
on a power-law network with r = 2, and then make 
1000 queries from uniformly selected nodes. We are 
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particularly concerned with the maximum demands 
made on any node. Some results are in the follow- 
ing table. BW is the total number of times each query 
is copied to search the entire network, and Max C/O 
is the average of the maximum CPU where 1 is the 
cost to evaluate the query, per content object in the 
network. 



Nodes 


q 


tti 


Hit-rate 


BW 


Max C/O 


2 19 


0.01 


20 


0.961 


10,428 


0.0075 


2 20 


0.01 


21 


0.966 


21,045 


0.0042 



We see that for the case on 1000 content objects, 
the maximum node had to search 7.5 and 4.2 ob- 
jects for each query on average, for the cases of 2 19 
and 2 20 nodes respectively. To scale these results up 
to our assumptions of 20 content objects per node. 
Additionally, the total query byte rate rate will be 
zNR = 800 x 2 19 x 1.2 x 10~ 4 w 48, 500-Bps = Q. 
For N = 2 19 we have: 



B a 
p 



10,428 



13,280 x 2 



= 7.7Kbps 
7.5 



19 
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The above parameters are a relatively close match to 
the predictions of the previous section. For N = 2 20 
we have Q' = 2Q: 
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The other parameters such as D ave , P ave and B max are 
not dependent on the nonlinearities of percolation, and 
as such match predictions of the previous section. 

Our simulations verify that the average bandwidth 
required is much less than analog modems can provide 
and the maximum processing requirements are met by 
one modern desktop CPU. 

5 Comparison of unstructured to struc- 
tured image search 

In order to compare unstructured search with cur- 
rent DHT-based approaches, we took PRISM [8] as a 
base for comparison. PRISM is a recent system with a 
clear focus on similarity queries over high-dimensional 
vectors. 

The PRISM paper also describes load balancing be- 
tween PRISM peers. However, within the following, we 
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Figure 2. The fraction of 20-NN found plotted 
against the fraction of the total collection vis- 
ited in "Vanilla" PRISM. 
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Figure 3. The fraction of 20-NN found plotted 
against the load on the most solicited peer in 
"Vanilla" PRISM without load balancing. 



will consider PRISM without load balancing, as perfect 
load balancing would amount to all peers carrying the 
average load. 

The performance metrics for the vanilla PRISM 
algorithm without load balancing. Are given in 
the table below. We chose visiting all 11 ref- 
erence pairs for our calculation. Again N = 
of R,C,f,z from Table 1. 
= 2.02B/s = 16.1bps 
= 12,600£/s= 100kbps 
= 176KB 
w 840MB 
= 670kFlOp/s 
-- hhGFlOp/s 
As Figs. 2 and 3, as well as Tab. 5 present our ex- 
periments with a simulation of PRISM. Without load 
balancing, PRISM behaves to quite an extent like a 
client/server system: most load hits few servers. Little 
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load is distributed, so there is low communication cost. 
For Euclidean distance, PRISM presents — if any — 
only small advantages over the super-peer network. In 
our current setup, in order to find 75% of the top 20 
documents, we have to visit each vector more than one 
time on average, i.e. PRISM performs worse than ran- 
dom search. If one wants to push the recall to 100% 
(as good as the super peer method), each item is con- 
sidered even more times on average, incurring a clear 
efficiency penalty with respect to a full scan. 

The second finding is that the distribution of data 
items over peers is heavily skewed, emphasizing the 
need for load balancing as proposed in [8] . Our experi- 
ment used 32 x 32 = 1024 pairs. In these experiments, 
the first 5 most used pairs account for more than 10% of 
the traffic, the first 15 pairs account for more than 25% 
of the traffic, and the first 60 pairs account for more 
than 50% of the traffic. To highlight this fact, Tab. 5 
shows PRISM without load balancing. Here, PRISM 
functions almost in a client/server- alike fashion. Please 
note that with the proper use of load balancing, PRISM 
would thus much behave like a super-peer network dis- 
cussed in Section 4.1, but with slightly higher load for 
the super-peers. 

The third finding, finally, should spawn a series of 
new experiments: The performance of systems like 
PRISM depends also on data set and distance mea- 
sure. However, most of the distributed indexing lit- 
erature is fixated on the Euclidean metric and similar 
distance measures. Our experiments show that it is 
clearly worthwhile to investigate deeper into the per- 
formance of such systems when using more diverse dis- 
tance measures. 

6 Conclusion 

Summarizing, the performance of structured and 
unstructured systems seem to be pretty close in our 
application domain, while unstructured systems have 
the advantage of being more flexible with respect to 
the queries they allow. We should mention that this 
conclusion is similar to the recent paper of [16]. 
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